Senior Researcher/Team Lead in Data Services for the Social Sciences @ GESIS
Previously PostDoc in Communication at VU Amsterdam, University of Amsterdam, and European New School of Digital Studies
Interested in:
Computational Social Science
Open and reproducible science
Automated Content Analysis
Hybrid Media Systems and Information Flows
Protest and Democracy
Experience:
R user since 2015
R package developer since 2017
Worked on several packages for text analysis, API access and web scraping (spacyr, quanteda.textmodels, LexisNexisTools, paperboy, traktok, rollama, amcat4-r, atrrr, rwhatsapp and more)
In RStudio go to “Create a project” (top left corner with this symbol ). (If you do not have this symbol, you need to install Git or GitHib Desktop first.) Then select “Version Control”:
Or if you are using the command line, you can simply type:
I tend to jump back and forth between the slides and RStudio
Slides
See the html file in each session folder
Source
See the qmd file in each session folder
Use the course marerial
Pull the latest version at the beginning of each session
Make a copy of the qmd file and name it, e.g., “1_Introduction_to_Computing_notes.qmd”
Use this file to make notes, for example by adding comments using this syntax <!-- your comment --> (RStudio shortcut Ctrl + Shift + C / Command + Shift + C on macOS)
Alternatively, open the slides in a browser and press e to export them to PDF (and take notes with PDF reader)
This is to make sure you don’t get any git conflicts when you pull and I updated something in the material in the meantime.
Your turn (Exercises 1)
Download the course material and open the RStudio project or folder in your IDE
Open the file readme.qmd from the file explorer of your IDE
Execute the final code Chunk to install all packages we will need for this course
R refresher
Packages
R organises its functions in packages (even base functions)
Most packages must be installed (once) and attached (every new session)
install.packages("tidyverse")library(tidyverse)
Accessing Functions
If you do not want to attach an entire package, you can use the Double Colon to only use a specific function:
https://www.r-bloggers.com/ (collection of personal blog posts related to R – so quality varies)
ChatGPT/Claude/Gemini/DeepSeek
library(askgpt)log_init()mean[1:10]askgpt("What is wrong with my last command?")
Functions
Functions are easy to define in R:
new_fun <-function(x =1) { out <-c(sum(x),mean(x),median(x) )return(out)}new_fun()
[1] 1 1 1
vec <-c(1:10)new_fun(x = vec)
[1] 55.0 5.5 5.5
Going through this bit by bit:
new_fun: The name of the new function (convention: use something descriptive; don’t use . or CamelCase but _ if you have multiple words)
<-: The assignment operator.
function(x): Define arguments and defaults here.
{}: Everything inside the curly brackets is the body of the function (code you are running when calling the function).
return(): All objects created inside the function are immediately destroyed when the function finished running. Except what is put in return() (can be implicit).
Data
In R, data is stored in objects. We will learn about different ways to do so tomorrow!
Loops
For loops
Iterate over a vector:
x <-NULLfor (i in1:10) {message(i) x <-c(x, i)}x
[1] 1 2 3 4 5 6 7 8 9 10
for: This is how you start the loop
i: This is the variable which takes a different value in each iteration of the loop
in: separates the variable from the vector
1:10: The vector over which to iterate
{}: The expression inside the round brackets is evaluated once for each value in the vector; i takes a different value each run
Especially for simple operations and statistics, base is still great.
model <-lm(hp ~ mpg, data = df) # simple linear regressionsummary(model)
Call:
lm(formula = hp ~ mpg, data = df)
Residuals:
Min 1Q Median 3Q Max
-59.26 -28.93 -13.45 25.65 143.36
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 324.08 27.43 11.813 8.25e-13 ***
mpg -8.83 1.31 -6.742 1.79e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 43.95 on 30 degrees of freedom
Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
base R
base also has a plotting system:
plot(df$mpg, df$hp, col ="blue", ylab ="horse power", xlab ="miles per gallon", main ="Simple linear regression")abline(model, col ="red")text(30, 300, "We can add some text", col ="red")
Tidyverse
What is it?
The official description: “The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures”.
The principle that gives the tidyverse its name is that of tidy data: “Each variable forms a column. Each observation forms a row.” (see tidyr vignette for more info)
Seems trivial at first but as a principle can be quite consequential (e.g., it means that most object types are ignored and data.frames are very dominant)
Some coding principles attached to it (e.g., the pipe, functions as verbs that build on each other)
The pipe
Formerly %>%, now native in R as |>
Forwards the result of one function to another
Makes for much more readable code:
transform(aggregate(. ~ cyl, data =subset(mtcars, hp >100), FUN =function(x) round(mean(x, 2))), kpl = mpg *0.4251)
You Can make this more readable by createing intermediate objects:
data1 <-subset(mtcars, hp >100) # take subset of original datadata2 <-aggregate(. ~ cyl, data = data1, FUN =function(x) round(mean(x, 2))) # aggregate by taking rounded meantransform(data2, kpl = mpg *0.4251) # convert miles per gallon to kilometer per liter
Note: You can interject the View() command at any line in a complicated pipeline to see the intermediate result in a spreadsheet-style data viewer.
Special package ggplot2
Completely overhauls the plotting system in R
IMO: the best plotting system in any programming/data science language
Implements the “Grammar of Graphics”: a language for describing custom plots instead of relying on predefined plotting functions
The specific logic makes it harder to learn than other packages, but you can express essentially any plots in it (I highly recommend using “ggplot2: Elegant Graphics for Data Analysis” to learn the package instead of individual tutorials)
Exercises 2
Run ggplot(data = mtcars). What do you see and why?
In the function pb_collect() from paperboy, what do the arguments ignore_fails and connections do?
Write a function that takes a numeric vector of miles per gallon consumption data and converts it to kilometer per liter. If anything other than a numeric vector is entered, the function should display an error (hint: see ?stop).
In the code below, check the sizes of the intermediate objects with object.size().
file_link <-"https://raw.githubusercontent.com/shawn-y-sun/Customer_Analytics_Retail/main/purchase%20data.csv"df <-read.csv(file_link)filtered_df <- df[df$Age >=50,]aggregated_df <-aggregate(filtered_df$Quantity, by =list(filtered_df$Day), FUN = sum)names(aggregated_df) <-c("day", "total_quantity")aggregated_df[order(aggregated_df$total_quantity, decreasing =TRUE)[1:5],]
How could the code above be improved if you only want the final result, the code should be readable and you care about memory usage?
Literate Programming
Background
“The language in which we express our ideas has a strong influence on our thought processes.”
― Donald Ervin Knuth, Literate Programming
When analysing data in R, a cornerstone of a good workflow is documenting what you are doing.
The whole point of doing data analysis in a programming language rather than a point and click tool is reproducibility.
Yet if your code does not run after a while and you don’t understand what you were doing when writing the code, it’s as if you had done your whole analysis in Excel!
Advantages
This is where literate programming has a lot of advantages:
Enhanced Documentation: Literate programming combines code and documentation in a single, integrated document. This approach encourages researchers to write clear and comprehensive explanations of their code, making it easier for others (and even themselves) to understand the working of the code, (research) design choices, and logic.
Improved Readability: By structuring code and documentation in a literate programming style, the resulting code becomes more readable and coherent. The narrative flow helps readers follow the thought process and intentions of the programmer, leading to improved comprehension and maintainability.
Modular and Reusable Code: Literate programming emphasizes the organization of code into coherent and reusable chunks as they writers come to think of them similar to paragraphs in a text, where each chunk develops one specific idea.
Collaboration and Communication: Literate programming enhances collaboration among developers by providing a common platform to discuss, share, and review code. The narrative style fosters effective communication, allowing team members to understand the codebase more easily and collaborate more efficiently.
Extensibility and Maintenance: Well-documented literate programs are typically easier to extend and maintain over time. The clear explanation of choices and functionality helps yourself and others in the future to make decisions about modifications, enhancements, and bug fixes.
Reproducibilty and accountability: when you save rendered output of an analysis, you know exactly how a table of plot was created. If there are several versions, you can always turn to the rendered document and check which data, code and package versions were used to do your analuysis (at least when documents were written in a specific way.
Quarto (and its predecessor R Markdown) were designed to make it easy for you to make the most of these advantages. We have already been using these tools throughout the workshop and I hope this made you more familiar with them.
Exercises 4
Use the function report_template() from my package jbgtemplates to start a new report
Add some simple analysis in it and render
Play around with the formats and produce at least a PDF and Word output of your document
Think about how the structure of the document enhances reproducibility
Some other tricks
The worst default setting in RStudio
The default setting to ask whether to save the current session is horrible. It eventually leads to you clicking yes, save the data and to the creation of a file called .Rdata. This file is loaded whenever you open RStudio! This makes RStudio slow and can lead to unexpected behaviour, even when you delete all objects in your environment with rm(list = ls()).
Exercises
Change that setting NOW and look for the .Rdata in your project and home directory.
Git some Version Control
Git is an extensive application and too much to go through. But you do not need all the functionality to make efficient use of it’s main promise: keeping track of what is happening in your projects and giving you the ability to revert to an older state of your project.
This screenshot shows the last months of my PhD, when I was furiously working on integrating comments from different people. Not only helped git to show me my progress nicely, I also did not have to worry about accidentally deleting anything that might still prove valuable. Especially towards the end, I often removed sections or copied them to other chapters. Whenever I could not find a specific section, I went back to the last commit when I could still remember where it was.
Additionally, GitHub offers some nice features to organise and plan projects around issues.
Here you can note down remaining problems and keep track of your progress. It keeps your head free for other things!
Many of you did not come to class to just scrape exercise pages. You probably had some initial data and/or research question in mind. Please write a short abstract on what you want to accomplish with the web scraping and data management skills you will learn here. The abstract should include:
general goal
research question
(preliminary) assessment what data you need, what data can be found on the website and what potential research questions you have in mind.
Deadline: Tuesday midnight
Reminder: Social Programme
DATE
Event
Time
Venue
MONDAY 7 July
Meet and Greet - in person
19:00 start
SU Bar
TUESDAY 8 July
Staff briefing - hybrid
18:30 start
NTC
TUESDAY 8 July
Climbing (participants)
18:30 start
Sports Centre
WEDNESDAY 9 July
Harold Clarke Speaker Series - hybrid
18:45 - 20.00
EBS
THURSDAY 10 July
Sports Night
18:30 - 20:30
Sports Centre
FRIDAY 11 July
Wivenhoe Pub Run
18:30 start
Wivenhoe pubs
MONDAY 14 JULY
SU bar Quiz
19:00 start
SU Bar
TUESDAY 15 JULY
Sports Night
18:30 - 20:30
Sports Centre
WEDNESDAY 16 JULY
Harold Clarke Speaker Series - hybrid
18:30
EBS
THURSDAY 17 JULY
Farewell Party Karaoke
20:30 - 23:30
SU Bar
Wrap Up
Save some information about the session for reproducibility.